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Abstract — BIG DATA IS THE FUTURE OF IT INDUSTRY. Here see the methodology i.e. ETL process used for analysis 
of big data by using Hadoop ecosystem. The analysis of big data extracts business values from the raw data and helps in 
gaining competitive advantage by different organisations. There is a drastic growth of data in the web applications and 
social networking and such data are said be as Big Data. It requires huge amount of time consumption to retrieve those 
datasets. It lacks in performance analysis. To overcome this problem the Hive queries with the integration of Hadoop are 
used to generate the report analysis for thousands of datasets. The objective is to store the data persistently along with the 
past history of the data set and performing the report analysis of that data set. The main aim of this system is to improve 
performance through parallelization of various operations such as loading the data, index building and evaluating the 
queries. Thus the performance analysis is done with parallelization. HDFS file system is used to store the data after 
performing the MapReduce operations and the execution time is decreased when the number of nodes gets increased. The 
performance analysis is tuned with the parameters such as the execution time and number of nodes.. 

Keywords — Big Data , Hadoop , HDFS. 

I. INTRODUCTION 

To generate information it requires massive collection of data. The data can be simple numerical figures and text documents, 
to more complex information such as spatial data, multimedia data, and hypertext documents. To take complete advantage of 
data; the data retrieval is simply not enough, it requires a tool for automatic summarization of data, extraction of the essence 
of information stored, and the discovery of patterns in raw data. With the enormous amount of data stored in files, databases, 
and other repositories, it is increasingly important, to develop powerful tool for analysis and interpretation of such data and 
for the extraction of interesting knowledge that could help in decision-making. The only answer to all above is ‘Data 
Mining’. 

Data mining is the extraction of hidden predictive information from large databases; it is a powerful technology with great 
potential to help organizations focus on the most important, information in their data warehouses [1][2][3][4] Data mining tools 
predict future trends and behaviours, helps organizations to make proactive knowledge -driven decisions [2] . The automated, 
prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical 
of decision support systems. Data mining tools can answer the questions that traditionally were too time consuming to 
resolve. They prepare databases for finding hidden patterns, finding predictive information that experts may miss because it 
lies outside their expectations. 

Data mining, popularly known as Knowledge Discovery in Databases (KDD), it is the nontrivial extraction of implicit, 
previously unknown and potentially useful information from data in databases [3][5] . Though, data mining and knowledge 
discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery 
process. 

1.1 Knowledge Discovery Process 

Data mining is one of the tasks in the process of knowledge discovery from the database. The steps in the KDD process 
contain: 

• Data cleaning: It is also known as data cleansing; in this phase noise data and irrelevant data are removed from the 
collection. 

• Data integration: In this stage, multiple data sources, often heterogeneous, are combined in a common source. 

• Data selection: The data relevant to the analysis is decided on and retrieved from the data collection. 

• Data transformation: It is also known as data consolidation; in this phase the selected data is transformed into forms 
appropriate for the mining procedure. 

• Data mining: It is the crucial step in which clever techniques are applied to extract potentially useful patterns [1][3] . 

• Pattern evaluation: In this step, interesting patterns representing knowledge are identified based on given measures. 

• Knowledge representation: It is the final phase in which the discovered knowledge is visually presented to the user. 
This essential step uses visualization techniques to help users understand and interpret the data mining results. 
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II. Related Work 

D. Abadi analyse the large scale data analysis with the traditional DBMS. The data management is scalable but there is 
replication of data. Replication of data leads to the fault tolerance [1] J.ABABI, AVI SILBERCHATZ analyse massive 
datasets on very large clusters is done within the HadoopDB architecture for the real world application like business data 
warehousing. It approaches for parallel databases in performance. Still there is no scalability. It consumes huge amount of 
time for execution [4] . Farah Habib Chan chary analyse large datasets among the clusters of machines are efficiently stored in 
the cloud storage systems. So that the same information on more than one system could operate the datasets even if any one 
of the system’s power fails [3] . According to IBM the amount of unstructured and multi -structured data within an average 
organization is about 80% (Savvas, 2011). Taking account the average data growth, annually by 59% (Pettey & Goasduff, 
2011), this percentage will likely be much higher in a few years. Not only the volume of data is becoming a problem, also the 
variety and velocity are issues we need to look at (Russom, 2011). This phenomenon is called “big data” and is identified as 
one of the biggest IT trends for 2012 (Pettey, 2012) [5][6][7] 

Time to market and innovation of new products are nowadays the key factors for enterprises. Data warehouses and BI 
support the process of making business decisions. These instruments allow the discovery of hidden pattern like asking 
unknown relations between certain facts or entities. This causes an important change: in the past, the question which has been 
run against the system was already known at the time of collecting the data, today it is common practice to catch all the data 
to hold it for questions which will be asked in the future [9] . That is the reason why Big Data is a hot growing topic in 
information science. To tackle the challenges of Big Data, a new type of technologies has emerged. Most of these 
technologies are distributed across many machines and have been grouped under the term "NoSQL". NoSQL is actually an 
acronym that expands to "Not Only SQL". In some ways, these new technologies are more complex than traditional 
databases, and in other ways they are simpler. There isn’t any solution that fits all situations. You must do some 
compromises. This point will also be analyzed in this thesis. These new systems can scale to vastly larger sets of data, but 
using these systems require also new sets of technologies. Many of these technologies were first pioneered by two big 
companies: Google and Amazon. The most popular is probably the MapReduce computation framework introduced by 
Google in 2004. Amazon created an innovative distributed key-value store called Dynamo. The open source community 
responded in the year following with Hadoop (free implementation of MapReduce), HBase, MongoDB, Cassandra, 
RabbitMQ and countless other projects [Marl2]. The heterogeneous mixture learning technology is an advanced technology 
used in big data analysis. In the above, we introduced difficulties that are inherent in heterogeneous mixture data analysis, the 
basic concept of heterogeneous mixture learning and the results of a demonstration experiment that dealt with electricity 
demand predictions. As the big data analysis increases its importance, heterogeneous mixture data mining technology is also 
expected to play a significant role in the market. The range of application of heterogeneous mixture learning will be 
expanded broader than ever in the future [8] . 


III. Methodology used 


3.1 The Hadoop Platform 


Hadoop is an open-source ecosystem used for storing, managing and analysing a large volume of data, designed to allow 
distributed processing of data sets across thousands of machines. 





Fig 1. Hadoop Ecosystem 
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HDFS Architecture 


Metadata (Name, replicas,...): 
/home/foo/data, 3,. 



Fig 2. HDFS Architecture 

A cluster running Hadoop means running a set of daemons on the different servers of the network. The daemons include: 


• Name Node: it is the master of HDFS that directs the slave Data Nodes daemons. It is recommended to run the Name 
Node on its own machine. The negative aspect to the Name Node is that it is a single point of failure. 

• Data Node: It is responsible of writing and reading HDFS blocks to actual files on the local file system. A Data Node 
may communicate with other Data Nodes to replicate its data for redundancy. 

• Secondary Name Node: it is an assistant daemon for monitoring purposes. 

• Job Tracker: Manages Map Reduce job execution. The Job Tracker daemon determines the execution plan, assigns 
nodes to different tasks. There is only one Job Tracker per Hadoop cluster. 

• Task Tracker: Manages the execution of individual tasks on the machine. Communicates to Job Tracker in order to 
obtain task requests. 

3.2 FileZilla 


FileZilla is free, cross-platform FTP application software. Consisting of FileZilla Client and FileZilla Server. Binaries are 
available for Windows , Linux , and Mac OS X. It supports FTP, SFTP, and FTPS. 

Filezilla is used for moving files from local systems to Cloudera directory using FTP client from your Windows. 


B sftp://doudera @192.168.157. 170 ^ileZii^^ 


Edit View Transfer Server Bookmarks Help New version available! 


x - 1 [701301 p v ** fc $r I - • a 


sflp://192. 168. 157.3 Username: doudera 


Port: I | Quickconnect ] [-] 


Status: 

Status: 

Status: 

Command: 

Status: 

Status: 

Status: 

local : C : \Users \edurekai 7\Google Drive \AU Courses Vfadoop Data\_Hadoop All DocsVtadoop Assignments\Twitter_Project\Project\flume-sources- 
File transfer successful, transferred 16, 384 bytes in 1 second 

Retrieving directory listing. . . 

Is 

Listing directory /home/doudera 

Directory listing successful 

Disconnected from server 

1.0-SNAPSHOT.jar => remote : /home/doudera/flume -sources - 1, 0-SNAPSHOT, jar 


i 

□ 

Local site: 1 C:\UsersVedurekai7\poogle Dnve\AII Courses Vtadoop Data\_Hadoop Al DocsVtadoop Asagnments\Twitter_Project'project\ 


Remote site: | /home/doudera 
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Filename 

Filesize 

Filetype 

Last modified 


Filename 

Filesize Filetype 

Last modified 
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"25 Hadoop_Project.pdf 841,954 

Adobe Acrobat Document 

6/13/2013 5:50:10 PM 


•• 

Ji .cache 

File folder 

8/5/2013 12:54:... 

drwx 

cloudera cl... 


desktop. ir 

142 

Configuration settings 

7/10/2013 5:22:32 PM 


Ji config 

File folder 

4/12/2011 

drwxr-xr-x 

cloudera cl... 

5 

flume-soi 

jrces-l.0-SNAPSHOT.jar 397,417 

Executable Jar File 

6/2/2013 2:55:18 PM 


Ji .dbus 

File folder 

3/30/2011 

drwx 

cloudera cl... 


@ hive-serdes-l.0-SNAPSHOT.jar 236,346 

Executable Jar File 

6/2/2013 3:30:16 PM 


Jil .fontconfig 

File folder 

4/1/2011 

drwxr-xr-x 

cloudera cl... 

J 

1 tweets 

1,636,034 

File 

6/2/2013 3:30:28 PM 


Ji .gconf 

File folder 

8/5/2013 12:54:... 

drwx 

cloudera cl... 


®]Hadoop_Project.pptx 1,867,489 

Microsoft PowerPoint Pre... 

6/2/2013 10:33:14 PM 


is .gconfd 

File folder 

8/5/2013 3:44:0... 

drwx 

cloudera cl... 


flume.txt 

1,329 

Text Document 

6/2/2013 2:41:12 PM 


.Ji .gnome2 

File folder 

4/12/2011 

drwx 

cloudera cl... 


[_jtweetercommands.txt 1,562 

Text Document 

6/2/2013 3:30:10 PM 


Ji .gnome2_private 

File folder 

3/30/2011 

drwx 

cloudera cl... 


§ apache-flume-1. 3. 1-bin. tar.gz 15,968,874 

WinRAR archive 

6/2/2013 2:13:36 PM 


ii .gstreamer-0.10 

File folder 

4/1/2011 

drwxr-xr-x 

cloudera cl... 







id .gvfs 

File folder 

8/5/2013 12:54:... 

dr-x 

cloudera cl... 







i, .icons 

File folder 

4/1/2011 

drwxr-xr-x 

cloudera cl... 







Ji .local 

File folder 

3/30/2011 

drwxr-xr-x 

cloudera cl... 







Ji .mozilla 

File folder 

4/1/2011 

drwx 

cloudera cl... 







Ji .nautilus 

File folder 

3/30/2011 

drwxr-xr-x 

cloudera cl... 







^ .pulse 

File folder 

8/5/2013 12:54:... 

drwx 

cloudera cl... 

- 

Selected 1 file. Total size 397,417 bytes 

21 files and 20 directories. Total si: 

re: 201,280,064 bytes 






Fig 3. Transferring files from window server to cloudera 
IV. Present work 


4.1 Hive Component of Hadoop 

It Support structured data, e.g., creating tables, as well as extensibility for unstructured data. 
• Hive Query for Creating Tables 
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Create table user (Userid int, Age int, Gender char) row format delimited fields; 

• Hive Query for inserting/loading data into table 

Load data Local inpath Vusers/local/users.txt’ into Table user; 

4.2 Flow Chart 



Fig 4. Outline of Present Work 
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V. Results 


Time taken: 0.045 seconds 

hive> CREATE TABLE LOAN STATS YEAR07 YEAR11 


> (ID INT, 

> 

> MEMBER ID INT, 

> 

> LOAN AMNT DOUBLE, 

> 

> FUNDED AMNT DOUBLE, 

> 

> FUNDED AMNT INV DOUBLE, 

> 

> TERM STRING, 

> 

> INT RATE STRING, 

> 

> INSTALLMENT DOUBLE, 

> 

> GRADE STRING, 

> 

> SUB GRADE STRING, 

> 

> EMP TITLE STRING, 

> 

> EMP LENGTH STRING, 

> 

> home ownership STRING, 


FIG 5. CREATE TABLE LOAN STATS YEAR07 YEARll. 


> num tl 90g dpd 24m DOUBLE, 

> 

> numtl 30dpd DOUBLE, 

> 

> numtl 120dpd 2m DOUBLE, 

> num il tl DOUBLE, 

> 

> mo sinoldilacct STRING, 

> 

> num actv rev tl DOUBLE, 


> mo sin old rev tlop STRING, 

> 

> mosinrcntrev tl op STRING, 

> 

> totalrevhilim DOUBLE, 

> 

> num revtl balgto DOUBLE, 

> 

> num oprevtl DOUBLE, 

> totcoll amt DOUBLE, 

> 

> policycode STRING 

> 

> ) 

> 

> row format delimited fields terminated by 


Time taken: 0.745 seconds 
hive> 


Fig 6. Create table loan stats year07 yearII 


jjj t:loudera-demo-Q.3.7 - VMware Player '(Non-commenzial use only) 
Player - 00 - eS K Hg) 


Time taken: 0.745 seconds 

hive> CREATE TABLE LO AN_STATS_YE AR 1 2_YE AR 1 3 
(ID INT, 

MEMBERID INT, 

LOANAMNT DOUBLE, 

FUNDED_AMNT DOUBLE, 

F U N D E D_AM NT_I N V DOUBLE, 

TERM STRING, 

INT_RATE STRING, 

INSTALLMENT DOUBLE, 

GRADE STRING, 

SUBGRADE STRING, 

EMP TITLE STRING, 

EMPLENGTH STRING, 
h ome own e rs h i p STRI NG , 
amnualinc DOUBLE, 
i s_i n c_v STRI NG , 
accep t_d STRI NG , 
expd STRING, 
listd STRING, 
issued STRING, 
loan_status STRING, 
pymnt_plan STRING, 
url STRING, 
description STRING, 
purpose STRING, 
title STRING, 
add r city STRING, 
addr state STRING, 


Fig 7. Create table loan_stats_year!12_year13 
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Fig 8. Create table loan stats year12 year13 


Time taken: 0.236 seconds 

hive> LOAD DATA LOCAL INPATH 1 /hoine/cloudera/LoanStats_2007-2011 . esv 1 OVERWRITE INTO TABLE LOAN_STAT5_YEAR07_YEAR1 1 ; 

Copying data from file: /home/cloudera/LnanStats_2fl07-2011 . csv 

Copying file: f ile: /home/cloudera/LoanStats_2B07-2Sll . csv 

Loading data to table bank loan.loan stats yearOI yearll 

Deleted hdfs://localhost/user/hive/warehouse/bank loan.db/loan stats jearOTjearll 

OK 

Time taken: 5.96 seconds 


Fig 9. LOADING DATA INTO TABLE “LOAN STATS YEAR07 YEAR11 


hive> LOAD DATA LOCAL INPATH 1 /ho[iie/cloudera/LoanStats_2012-2013 . csv 1 OVERWRITE INTO TABLE L 0 AN ST ATS_YE AR 1 2_YE AR 1 3 ; 

Copying data from file : /home/cloude ra/LoanStats_20 12-2013 . csv 

Copying file: file : /home/cloudera/LoanStats_2012-2013 . csv 

Loading data to table bank loan . loan stats yea rl2_yea rl3 

Deleted hdfs : //localhost/user/hive/warehouse/bank loan . db/loan_stats_yearl2_yearl3 

OK 

Time taken: 37.63 seconds 
hive> | 


FIG 10. LOADING DATA INTO TABLE “LOAN STATS YEAR12 YEAR13” 


Time taken: 0.068 seconds 
Total MapReduce jobs = 2 
Launching Job 1 out of 2 

Number of reduce tasks not specified. Estimated from input data size: 1 
In order to change the average load for a reducer (in bytes): 

set hive. exec. reducers. bytes. per. reducer=<number> 

In order to limit the maximum number of reducers: 

set hive. exec. reduce rs.max=<number> 

In order to set a constant number of reducers: 
set mapred. reduce. tasks=<number> 

Starting Job = job 201407122316 0001, Tracking URL = http ://localhost: 50030/ jobdetails.jsp?jobid=job 201407122316 0001 
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred. job. tracker=localhost: 8021 -kill job 201407122316 0001 
2014-07-13 01:24:57,226 Stage- 1 map = 0%, reduce = 0% 

2014-07-13 01:25:16,112 Stage-1 map = 11%, reduce = 0% 

2014-07-13 01:25:18,519 Stage-1 map = 37%, reduce = 0% 

2014-07-13 01:25:20,899 Stage- 1 map = 65%, reduce = 0% 

2014-07-13 01:25:21,915 Stage-1 map = 67%, reduce = 0% 

2014-07-13 01:25:30,995 Stage- 1 map = 100%, reduce = 0% 

2014-07-13 01:25:39,082 Stage- 1 map = 100%, reduce = 100% 

Ended Job = job 201407122316 0001 
Launching Job 2 out of 2 

Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 

set hive. exec. reducers. bytes. per. reduce r=<number> 

In order to limit the maximum number of reducers: 

set hive . exec . reducers . max=<number> 

In order to set a constant number of reducers: 
set mapred. reduce. tasks=<number> 

Starting Job = job 2014Q7122316 0OO2, Tracking URL = http://localhost:50030/jobdetails. jsp?jobid=job 201407122316 Q0O2 
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred. job. tracker=localhost:8021 -kill job 201407122316 0002 
2014-07-13 01:25:43,816 Stage-2 map = 0 %, reduce = 0 % 

2014-07-13 01:25:51,109 Stage-2 map = 10©%, reduce = 0% 

2014-07-13 01:26:02,180 Stage-2 map = 100%, reduce = 100% 

Ended Job = job 201407122316 0002 


Fig 11. Execution 
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Total MapReduce job* = 2 
Launching Job 1 out of 2 

Nuaber of reduce task* not specified. Estimated Irom input data size: 1 
In order to change the average load for a reducer <ln bytes): 

set hive. exec . reducers. bytes .per . reduce r=«nua*»er> 

In order to Halt the Hxtaua nuaber of reducers: 

set hive. exec . reducers. «ax=<nuaber» 

In order to set a constant nuaber of reducers: 
set aapred . reduce . tasks»<nuober» 

Starting Job - Job 201447122316 0003. Tracking URL - http: //localhost : 50030/ jobdetail*. ) sp?Jobid= job 201407122316 0003 

Kill Coaaand = /usr/l lb/hadoop/bln/hadoop job - Oaapred . job. t racker=localhost : 8021 -kill Job 201407122316 0003 

2014-07-13 01:33:27.292 Stage-1 aap « 8N. reduce « ON 

20140713 01:33:36.122 Stage 1 aap » 19N. reduce « ON 

2014-07-13 01:33:38.424 Stage 1 aap = 5SN. reduce - ON 

2014-07-13 01:33:39.492 Stage-1 aap = 67N. reduce - ON 

2014-07-13 01:33:46.620 Stage-1 aap = 1O0N. reduce - ON 

2014-07-13 61 : 33:S2.SS4 Stage- 1 aap = lOON. reduce * 100N 

Ended Job - Job 201407122316 0003 

Launching Job 2 out of 2 

Nuaber of reduce tasks deterained at coapile tiae: 1 
In order to change the average load for a reducer (in bytes): 
set hive. exec . reducers .bytes .per . reducer=«nuaber> 

In order to Halt the maximum nuaber of reducers: 


Fig 12. Execution 


In order to change the average load for a reducer (in bytes): 

set hive . exec . reducers . bytes . per . reducer=<number> 

In order to limit the maximum number of reducers: 

set hive . exec .reducers. max=<numbe r> 

In order to set a constant number of reducers: 
set mapred. reduce. tasks=<number> 

Starting Job = job_201407122316 0004, Tracking URL = http : //localhost : 50030/ j obdetails . j sp? j obid=j ob_201407122316_0004 
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred . j ob . t racker=localhost : 8021 -kill job 201407122316 0004 
2014-07-13 01:33:58,358 Stage-2 map = 0%, reduce = 0% 

2014-07-13 01:33:59,364 Stage-2 map = 100%, reduce = 0% 

2014-07-13 01:34:08,411 Stage-2 map = 100%, reduce = 100% 

Ended Job = job 201407122316 0004 

Moving data to: hdfs://localhost/user/hive/warehouse/bank_loan.db/loan_processed_successfully 
OK 

Time taken: 45.345 seconds 
hive> 


Fig 13. Execution 


VI. Conclusion and future scope 

6.1 Conclusion 

In this present work, Data mining is performed by using the Hadoop Ecosystem Approach. The presented work is about to 
performed data mining on large loan data sets using Hive component of Hadoop Ecosystem. Here, the hive queries are 
performed for mining the useful data like “we are pulling those Bank Customers whose Loans were processed successfully 
starting from year 2007 till 201 3. They were proven as the best customers for banks as their payment schedule and other 
incomes were verified and made on timely basis. They were provided Loan at ROI = 7% and for 3years duration.” This 
information can be mined in less time because of parallelization feature of Hadoop ecosystem. 

So from our data analytics we conclude that those customers are best market for Banks in future and can be given priority 
over other customers. Companies can create separate operational data store (ODS) to make inventory of those customers for 
faster search & processing of loan. 

6.2 Future Work 

In this present work, we have performed data mining on large structured data (big data) by executing hive queries on hive 
component of Hadoop Ecosystem. 

Our future work will focus on Analysis or mining of unstructured data like images, audios, videos, graphs using map reduce 
component of Hadoop Ecosystem. 

Obtained results from the system shows that the effective Data mining is been performed. . 
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